Final Project: Data Cleaning and Simulations!!¶
Welcome to our final project our team is **Zane, Kenadi, and AJ** !!
We wanted to study what factors affect song popularity and wanted to run a simulation to predict popularity based on these factors. Before we start this was our main hypothesis.
We looked at the data and saw many different measures such as available markets, duration, and spotify metrics (loudness, liveness, danceability, energy, acousticness, valence) Valence is the general positive or negative feel of the song (0.0 to 1.0)!!
Ex) a song of 0 valence is very negative -- a song of 1 valence is very happy sounding.
Hypothesis: We predict more popular artists will have a larger fan bases and will be more popular. We also predict that Spotify's danceability/energy predictors are positively correlated the popularity, yet this is influenced by other data such as marketing and genre which is not found in our data set.¶
Here is our cleaning of the data
#!pip install --upgrade altair
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('spotify_data_1986_2023.csv') #import dataset
print(df.columns)
Index(['Unnamed: 0.1', 'Unnamed: 0', 'track_id', 'track_name', 'popularity',
'available_markets', 'disc_number', 'duration_ms', 'explicit',
'track_number', 'href', 'album_id', 'album_name', 'album_release_date',
'album_type', 'album_total_tracks', 'artists_names', 'artists_ids',
'principal_artist_id', 'principal_artist_name', 'artist_genres',
'principal_artist_followers', 'acousticness', 'analysis_url',
'danceability', 'energy', 'instrumentalness', 'key', 'liveness',
'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence',
'year', 'duration_min'],
dtype='object')
Remove Irrelevant Columns¶
dropping = ['Unnamed: 0.1', 'Unnamed: 0', 'href', 'artists_ids', 'principal_artist_id', 'analysis_url', 'album_id'] #drop irrelevant columns
df = df.drop(dropping, axis = 1)
df
| track_id | track_name | popularity | available_markets | disc_number | duration_ms | explicit | track_number | album_name | album_release_date | ... | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | year | duration_min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2A6yzRGMgSQCUapR2ptm6A | True Colors | 73 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 227600 | False | 4 | True Colors | 1986-10-14 00:00:00 | ... | 0.0 | 0.1160 | -13.155 | 1.0 | 0.0331 | 171.282 | 4.0 | 0.268 | 1986 | 3.793333 |
| 1 | 3gKwVWwKmeuFtPubICbOGc | Paul Revere | 61 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 220840 | False | 9 | Licensed To Ill | 1986-11-15 00:00:00 | ... | 6.0 | 0.2280 | -13.232 | 0.0 | 0.9440 | 91.753 | 4.0 | 0.578 | 1986 | 3.680667 |
| 2 | 2tY1gxCKslfXLFpFofYmJQ | Brass Monkey | 68 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 157440 | False | 11 | Licensed To Ill | 1986-11-15 00:00:00 | ... | 6.0 | 0.0992 | -8.435 | 0.0 | 0.3300 | 116.371 | 4.0 | 0.655 | 1986 | 2.624000 |
| 3 | 31dqpLUModJWNbxrXu6TWd | Shot in the Dark | 66 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 256293 | False | 9 | The Ultimate Sin | 1986-02-22 00:00:00 | ... | 6.0 | 0.1420 | -7.262 | 1.0 | 0.0324 | 123.887 | 3.0 | 0.619 | 1986 | 4.271550 |
| 4 | 00vYs0qZA40Z8AAaN7xmMO | Manic Monday | 63 | AE;BH;EG;GB;IE;IQ;JO;KW;LB;LY;OM;PS;QA;SA;US | 1 | 184160 | False | 1 | Different Light (Expanded Edition) | 1986-01-01 00:00:00 | ... | 2.0 | 0.3580 | -10.906 | 1.0 | 0.0289 | 121.761 | 4.0 | 0.853 | 1986 | 3.069333 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11445 | 4nrPB8O7Y7wsOCJdgXkthe | Shakira: Bzrp Music Sessions, Vol. 53 | 89 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 214945 | False | 1 | Shakira: Bzrp Music Sessions, Vol. 53 | 2023-01-11 00:00:00 | ... | 2.0 | 0.0915 | -5.600 | 0.0 | 0.0493 | 122.104 | 4.0 | 0.498 | 2023 | 3.582417 |
| 11446 | 7Lkxvfl2rkNYWS4kBDCQtN | Las Morras | 81 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 155550 | True | 1 | Las Morras | 2023-04-04 00:00:00 | ... | 9.0 | 0.0706 | -4.641 | 0.0 | 0.0407 | 133.007 | 3.0 | 0.901 | 2023 | 2.592500 |
| 11447 | 6UoKX6uLJwhsnyTp5k5StP | The Painter | 75 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 211413 | False | 1 | The Painter | 2023-08-11 00:00:00 | ... | 2.0 | 0.1070 | -5.766 | 1.0 | 0.0329 | 97.981 | 4.0 | 0.261 | 2023 | 3.523550 |
| 11448 | 4ZYAU4A2YBtlNdqOUtc7T2 | Red Ruby Da Sleeze | 78 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 214445 | True | 1 | Red Ruby Da Sleeze | 2023-03-03 00:00:00 | ... | 1.0 | 0.1110 | -6.181 | 1.0 | 0.2560 | 98.355 | 4.0 | 0.292 | 2023 | 3.574083 |
| 11449 | 5vZoQQ1hH5L2s4Y8G86ksg | Angels (Don’t Always Have Wings) | 74 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 200426 | False | 1 | Angels (Don’t Always Have Wings) | 2023-01-20 00:00:00 | ... | 1.0 | 0.0913 | -4.257 | 1.0 | 0.0283 | 66.041 | 3.0 | 0.230 | 2023 | 3.340433 |
11450 rows × 30 columns
Look at the amount of letters and amount of words in the track name¶
First let's drop the duplicates¶
df.drop_duplicates(subset='track_name', inplace=True) #first drop duplicates
df[df['track_name'] == "Gonna Make You Sweat (Everybody Dance Now) (feat. Freedom Williams)"] #check it worked, yay it did!
| track_id | track_name | popularity | available_markets | disc_number | duration_ms | explicit | track_number | album_name | album_release_date | ... | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | year | duration_min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1246 | 7eheoW4hxrH51ww2QePVwF | Gonna Make You Sweat (Everybody Dance Now) (fe... | 71 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 244093 | False | 1 | Super Hits | 1990-01-01 00:00:00 | ... | 6.0 | 0.183 | -7.405 | 1.0 | 0.0935 | 112.187 | 4.0 | 0.794 | 1990 | 4.068217 |
1 rows × 30 columns
Now let's start looking at the names, there are some special cases to see that we don't want counted when we sum the letters¶
special_cases = df[
df['track_name'].str.contains(r'\([A-Za-z0-9 ].*\)$', regex=True) | #has parentheses
df['track_name'].str.contains(r'-\s*[A-Za-z].*$', regex=True) | #has a dash
df['track_name'].str.contains(r':\s*[A-Za-z].*$', regex=True) #has a colon
]
# Show only the track_name column
print(special_cases['track_name'])
print(len(special_cases))
26 Battery (Remastered)
27 Master of Puppets (Remastered)
31 Only You (And You Alone)
34 Holding Out for a Hero - From "Footloose" Soun...
47 Walk This Way (feat. Aerosmith)
...
11429 Pull Up (feat. 21 Savage)
11430 MONTAGEM - PR FUNK
11444 AMERICA HAS A PROBLEM (feat. Kendrick Lamar)
11445 Shakira: Bzrp Music Sessions, Vol. 53
11449 Angels (Don’t Always Have Wings)
Name: track_name, Length: 1696, dtype: object
1696
So it looks like there are 1696 cases, wow that is a lot!¶
Let's try to remove all of the irrelevant ones while keeping things like "(Live)" or parts of the title like "Only You (And You Alone)"¶
### clean track names
import re
df['clean_track_name'] = df['track_name'].str.replace(
r'\s*(\((?:feat\.?|featuring|with)[^)]*\)|\([^)]*(revisited|remaster(?:ed)?)[^)]*\)|[-–—].*|:.*)$',
'',
regex=True,
flags=re.IGNORECASE, #replaces my key words regardless of if it is capital or lowercase
).str.strip()
# note for str.replace:
# \s* → optional spaces before extra info
# \( ... \) → matches parentheses
# (?:feat\.?|featuring|with) → only removes parentheses if they start with "feat", "feat.", or "featuring", or "with"
# [^)]* → match everything inside the parentheses until the closing parenthesis
# \([^)]*(revisited|remaster(?:ed)?)[^)]* → removes parentheses containing "revisited", "remaster", or "remastered" anywhere inside
# [-–—].* → match a dash (-), en dash (–), or em dash (—) and everything after it (at the end)
# :.* → match a colon and everything after it (at the end)
# $ → only match at the end of the string
# flags=re.IGNORECASE → makes regex case-insensitive (so "Remastered" is matched)
# .str.strip() → removes extra spaces left at the beginning or end
#move clean tarck name so it is next to track name
clean_col = df.pop('clean_track_name')
df.insert(2, 'clean_track_name', clean_col)
### Count letters can make new column!
# Count letters + numbers (all languages)
letter_num_count = df['clean_track_name'].apply(lambda x: sum(c.isalnum() for c in str(x)))
# Remove existing column if it already is there - this will let me edit the code above without it giving me an error that the column already exists
if 'letters_track_name' in df.columns:
df.drop(columns='letters_track_name', inplace=True)
# Insert next to clean track name column so I can see them side by side
df.insert(3, 'letters_track_name', letter_num_count)
#look at it to check :)
pd.set_option('display.max_rows', 50)
df[['track_name', 'clean_track_name', "letters_track_name"]].sample(50, random_state=42) #show me a random 50 so I can check my work
| track_name | clean_track_name | letters_track_name | |
|---|---|---|---|
| 5634 | A Boy Brushed Red Living In Black And White | A Boy Brushed Red Living In Black And White | 35 |
| 9756 | Unbothered | Unbothered | 10 |
| 1348 | Is She Weird | Is She Weird | 10 |
| 4171 | Unpretty | Unpretty | 8 |
| 1893 | Black or White | Black or White | 12 |
| 4594 | Yo No Soy Esa Mujer | Yo No Soy Esa Mujer | 15 |
| 3820 | Quizás Si, Quizás No | Quizás Si, Quizás No | 16 |
| 8452 | r - Cali | r | 1 |
| 11004 | No Se Va - EN VIVO | No Se Va | 6 |
| 6809 | 二十歳の恋 | 二十歳の恋 | 5 |
| 8955 | Delusions of Saviour | Delusions of Saviour | 18 |
| 33 | Talk Dirty To Me | Talk Dirty To Me | 13 |
| 11127 | PAINTING PICTURES | PAINTING PICTURES | 16 |
| 3383 | Jump On It | Jump On It | 8 |
| 7498 | The Resistance | The Resistance | 13 |
| 2651 | Y Es Que La Quiero | Y Es Que La Quiero | 14 |
| 5266 | Such Great Heights - Remastered | Such Great Heights | 16 |
| 7166 | Evacuate The Dancefloor - Radio Edit | Evacuate The Dancefloor | 21 |
| 3785 | Da Art of Storytellin' (Pt. 1) | Da Art of Storytellin' (Pt. 1) | 21 |
| 10181 | Another Life | Another Life | 11 |
| 9658 | Blue Tacoma | Blue Tacoma | 10 |
| 3692 | A Spoonful of Sugar - From "Mary Poppins" / So... | A Spoonful of Sugar | 16 |
| 5032 | I Can See Clearly Now | I Can See Clearly Now | 17 |
| 4151 | Un Desengaño | Un Desengaño | 11 |
| 5181 | Don't Forget Me | Don't Forget Me | 12 |
| 5440 | Figure.09 | Figure.09 | 8 |
| 10969 | Vegas (From the Original Motion Picture Soundt... | Vegas (From the Original Motion Picture Soundt... | 48 |
| 556 | Moonglow | Moonglow | 8 |
| 8073 | Inténtalo (feat. América Sierra & El Bebeto) | Inténtalo | 9 |
| 1378 | The Ghetto | The Ghetto | 9 |
| 2984 | Secret Garden | Secret Garden | 12 |
| 3973 | Así Fue - En Vivo [Desde el Instituto Nacional... | Así Fue | 6 |
| 3994 | the city | the city | 7 |
| 1140 | Seven Year Ache | Seven Year Ache | 13 |
| 1466 | Tiempo De Vals | Tiempo De Vals | 12 |
| 3968 | Feelin' Good Again | Feelin' Good Again | 15 |
| 10350 | GIVE HEAVEN SOME HELL | GIVE HEAVEN SOME HELL | 18 |
| 9678 | El Color de Tus Ojos | El Color de Tus Ojos | 16 |
| 9137 | Party Monster | Party Monster | 12 |
| 7615 | Rolling in the Deep | Rolling in the Deep | 16 |
| 2684 | Hang on to Your Love | Hang on to Your Love | 16 |
| 9271 | walk away as the door slams | walk away as the door slams | 22 |
| 1369 | Something's Gotta Give | Something's Gotta Give | 19 |
| 10530 | BookBag 2.0 (feat. Polo G) | BookBag 2.0 | 9 |
| 2317 | Take The Power Back | Take The Power Back | 16 |
| 2018 | Drivin' My Life Away | Drivin' My Life Away | 16 |
| 321 | Welcome To The Jungle | Welcome To The Jungle | 18 |
| 5605 | Ain't No Use in Tryin' | Ain't No Use in Tryin' | 16 |
| 10333 | Romantic Lover | Romantic Lover | 13 |
| 7207 | Help I'm Alive | Help I'm Alive | 11 |
Okay so it is not perfect, there are some weird exceptions but that is okay. I got most of the stuff out. Let me look at the full data set again just for clarity.¶
df.sample(10, random_state=46)
| track_id | track_name | clean_track_name | letters_track_name | popularity | available_markets | disc_number | duration_ms | explicit | track_number | ... | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | year | duration_min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10915 | 3k3NWokhRRkEPhCzPmV8TW | Ojitos Lindos | Ojitos Lindos | 12 | 89 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 258298 | False | 14 | ... | 3.0 | 0.5280 | -5.745 | 0.0 | 0.0413 | 79.928 | 4.0 | 0.268 | 2022 | 4.304967 |
| 3950 | 1fotoYONO343JjbC8XvPSl | Moment Of Truth | Moment Of Truth | 13 | 59 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 247333 | True | 8 | ... | 6.0 | 0.3710 | -5.282 | 0.0 | 0.2560 | 90.215 | 4.0 | 0.594 | 1998 | 4.122217 |
| 4714 | 6uRH1qMz30ZBwwUG0IYE5s | Dance With Me | Dance With Me | 11 | 65 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 231800 | False | 2 | ... | 9.0 | 0.0442 | -6.365 | 1.0 | 0.2030 | 101.628 | 4.0 | 0.611 | 2001 | 3.863333 |
| 6885 | 4gzeYkzuzxuzAUTsGcdjqA | It Won't Be Like This For Long | It Won't Be Like This For Long | 23 | 63 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 219000 | False | 8 | ... | 4.0 | 0.1100 | -6.001 | 1.0 | 0.0287 | 143.863 | 4.0 | 0.389 | 2008 | 3.650000 |
| 5544 | 3xrn9i8zhNZsTtcoWgQEAd | Since U Been Gone | Since U Been Gone | 14 | 75 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 188960 | False | 2 | ... | 0.0 | 0.1140 | -5.406 | 1.0 | 0.0334 | 131.000 | 4.0 | 0.404 | 2004 | 3.149333 |
| 8017 | 24LS4lQShWyixJ0ZrJXfJ5 | Sweet Nothing (feat. Florence Welch) | Sweet Nothing | 12 | 76 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 212560 | False | 10 | ... | 8.0 | 0.0567 | -3.942 | 0.0 | 0.1090 | 127.934 | 4.0 | 0.582 | 2012 | 3.542667 |
| 10395 | 3hLuHKzG1cmlRpq53ZVWd8 | The Good Ones | The Good Ones | 11 | 76 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 215140 | False | 7 | ... | 2.0 | 0.1490 | -5.023 | 1.0 | 0.0259 | 89.957 | 4.0 | 0.331 | 2020 | 3.585667 |
| 7749 | 725NSbIej5lP3GfhLC7So3 | 115 | 115 | 3 | 68 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 228160 | False | 14 | ... | 8.0 | 0.1020 | -3.808 | 1.0 | 0.1750 | 200.215 | 4.0 | 0.676 | 2011 | 3.802667 |
| 8224 | 4sebUbjqbcgDSwG6PbSGI0 | Come a Little Closer | Come a Little Closer | 17 | 75 | AU;CA;CY;DO;HK;IS;MT;MX;NZ;TW;TR;US;ID;JP;RO;I... | 1 | 229346 | False | 2 | ... | 1.0 | 0.1510 | -1.786 | 0.0 | 0.0327 | 148.038 | 4.0 | 0.304 | 2013 | 3.822433 |
| 3530 | 6cKWDVak6o362TElLvwtmU | Suavecito Suavecito | Suavecito Suavecito | 18 | 67 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 1 | 225720 | False | 4 | ... | 0.0 | 0.0621 | -11.398 | 1.0 | 0.0413 | 96.267 | 4.0 | 0.923 | 1997 | 3.762000 |
10 rows × 32 columns
Looks gorgeous :)¶
Calculate how many countries the song is available in¶
# count the markets by taking everything between semicolons and making it a string ex: 'US', then count the amount of strings within this list
market_count = df['available_markets'].str.split(';').str.len()
# Remove existing column if it exists
if 'market_count' in df.columns:
df.drop(columns='market_count', inplace=True)
# Insert next to markets for readibility
df.insert(df.columns.get_loc('available_markets') + 1, 'market_count', market_count)
# Check dataset
df.sample(10, random_state=14)
| track_id | track_name | clean_track_name | letters_track_name | popularity | available_markets | market_count | disc_number | duration_ms | explicit | ... | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | year | duration_min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 443 | 7KA6U0WOHdGxWWLGPYN2Sb | On the Turning Away | On the Turning Away | 16 | 58 | AR;AU;BO;BR;CA;CL;CO;CR;DO;EC;SV;GT;HN;HK;MY;M... | 128.0 | 1 | 341666 | False | ... | 7.0 | 0.0760 | -12.135 | 1.0 | 0.0300 | 100.454 | 4.0 | 0.2440 | 1987 | 5.694433 |
| 827 | 1Xf1lWBSml62NG1du3Ro14 | Just The Way You Are | Just The Way You Are | 16 | 66 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 184.0 | 1 | 289106 | False | ... | 9.0 | 0.1490 | -14.630 | 1.0 | 0.0494 | 74.948 | 4.0 | 0.3090 | 1988 | 4.818433 |
| 877 | 64IOxX7fXk89bMG2831w4G | Goodbye Time | Goodbye Time | 11 | 47 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 184.0 | 1 | 207533 | False | ... | 2.0 | 0.1110 | -17.392 | 1.0 | 0.0317 | 147.117 | 4.0 | 0.2710 | 1988 | 3.458883 |
| 6849 | 4LloVtxNZpeh7q7xdi1DQc | Free Fallin' - Live at the Nokia Theatre, Los ... | Free Fallin' | 10 | 75 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 184.0 | 1 | 263573 | False | ... | 5.0 | 0.7020 | -13.262 | 1.0 | 0.0315 | 173.509 | 4.0 | 0.4350 | 2008 | 4.392883 |
| 2083 | 00QAndVDVfNqNWYdWAhEan | Who Wants To Live Forever - Remastered 2011 | Who Wants To Live Forever | 21 | 50 | CA;US | 2.0 | 1 | 295346 | False | ... | 4.0 | 0.1150 | -8.670 | 0.0 | 0.0361 | 87.476 | 5.0 | 0.0383 | 1991 | 4.922433 |
| 9823 | 4MXhiYIRDMGAuvZc5IFTwC | ASTROTHUNDER | ASTROTHUNDER | 12 | 78 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 184.0 | 1 | 142969 | True | ... | 4.0 | 0.1110 | -8.111 | 0.0 | 0.0661 | 151.901 | 4.0 | 0.2860 | 2018 | 2.382817 |
| 7694 | 0Uybrtb766jul6WpkjqbID | Hard Times | Hard Times | 9 | 70 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 176.0 | 1 | 178160 | False | ... | 7.0 | 0.1020 | -13.641 | 1.0 | 0.0333 | 133.230 | 3.0 | 0.4000 | 2011 | 2.969333 |
| 7203 | 76LGCP0g9nVknR7HD2Jjyp | Not The American Average | Not The American Average | 21 | 65 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 183.0 | 1 | 279226 | True | ... | 2.0 | 0.2780 | -4.106 | 0.0 | 0.1930 | 184.975 | 4.0 | 0.0676 | 2009 | 4.653767 |
| 5387 | 5TpaWJKnuyA4MjzAbFXSTQ | Damn! (feat. Lil' Jon) - Club Mix | Damn! (feat. Lil' Jon) | 14 | 59 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;E... | 183.0 | 1 | 298600 | True | ... | 6.0 | 0.3550 | -5.799 | 0.0 | 0.3070 | 84.038 | 4.0 | 0.5180 | 2003 | 4.976667 |
| 4454 | 4oPNN7syJYSjzDhRerF966 | Untitled (How Does It Feel) | Untitled (How Does It Feel) | 21 | 60 | AR;AU;AT;BE;BO;BR;BG;CA;CL;CO;CR;CY;CZ;DK;DO;D... | 184.0 | 1 | 430800 | False | ... | 4.0 | 0.0828 | -9.975 | 0.0 | 0.0626 | 112.618 | 3.0 | 0.2210 | 2000 | 7.180000 |
10 rows × 33 columns
Count how many artists made/are on the track¶
df['artists_names'].sample(5, random_state=16)
4930 Trapt 5167 Beck 6827 Angel Y Khriz;Gocho "El Lápiz De Platino";John... 10428 DJ Scheme;Ski Mask The Slump God;Danny Towers;... 322 INXS Name: artists_names, dtype: object
Looks like I can do the same thing I did for markets¶
df.columns.get_loc('artists_names')
15
# count the artists by taking everything between semicolons and making it a string ex: 'Taylor Swift', then count the amount of strings within this list
artist_numb = df['artists_names'].str.split(';').str.len()
# Remove existing column if it exists
if 'artist_numb' in df.columns:
df.drop(columns='artist_numb', inplace=True)
# Insert next to markets for readibility
df.insert(df.columns.get_loc('artists_names') + 1, 'artist_numb', artist_numb)
df.iloc[:, 16:21].sample(5, random_state=16) #look at the added column to check (before "," = all rows, show all the surrounding columns)
| artist_numb | principal_artist_name | artist_genres | principal_artist_followers | acousticness | |
|---|---|---|---|---|---|
| 4930 | 1 | Trapt | alternative metal;nu metal;post-grunge | 1072711.0 | 0.000172 |
| 5167 | 1 | Beck | alternative rock;anti-folk;permanent wave;rock | 1466809.0 | 0.260000 |
| 6827 | 3 | Angel Y Khriz | latin hip hop;reggaeton | 1008284.0 | 0.184000 |
| 10428 | 4 | DJ Scheme | viral rap | 254790.0 | 0.074000 |
| 322 | 1 | INXS | australian rock;dance rock;funk rock;new roman... | 2457267.0 | 0.071300 |
Now calculate how many genres the main artist produces in¶
df['artist_genres'].sample(5, random_state=16)
4930 alternative metal;nu metal;post-grunge 5167 alternative rock;anti-folk;permanent wave;rock 6827 latin hip hop;reggaeton 10428 viral rap 322 australian rock;dance rock;funk rock;new roman... Name: artist_genres, dtype: object
Again, looks like I can do the same thing I did for markets¶
df.columns.get_loc('artist_genres')
18
# count the genres by taking everything between semicolons and making it a string ex: 'Pop', then count the amount of strings within this list
genres_numb = df['artist_genres'].str.split(';').str.len()
# Remove existing column if it exists
if 'genres_numb' in df.columns:
df.drop(columns='genres_numb', inplace=True)
# Insert next to markets for readibility
df.insert(df.columns.get_loc('artist_genres') + 1, 'genres_numb', genres_numb)
df.iloc[:, 19:24].sample(5, random_state=30) #look at the added column to check (before "," = all rows, show all the columns)
| genres_numb | principal_artist_followers | acousticness | danceability | energy | |
|---|---|---|---|---|---|
| 9624 | 2.0 | 50572176.0 | 0.0185 | 0.704 | 0.859 |
| 2879 | 5.0 | 683645.0 | 0.0287 | 0.754 | 0.785 |
| 398 | 8.0 | 632734.0 | 0.0139 | 0.927 | 0.832 |
| 11374 | 1.0 | 6957293.0 | 0.2800 | 0.415 | 0.573 |
| 8315 | 3.0 | 5646381.0 | 0.0737 | 0.343 | 0.536 |
Now let's look at the album release date and do some stuff with that¶
df['album_release_date'].sample(5, random_state=16)
4930 2002-11-05 00:00:00 5167 2002-01-01 00:00:00 6827 2008-01-01 00:00:00 10428 2020-12-04 00:00:00 322 1987-01-01 00:00:00 Name: album_release_date, dtype: object
Let's figure out the month and day of the week they were released in. Right now we only have this format 1986-10-14 00:00:00.¶
df.columns.get_loc('album_release_date')
12
# change column is datetime
df['album_release_date'] = pd.to_datetime(df['album_release_date'])
# Create new columns
df['release_month'] = df['album_release_date'].dt.month
df['release_weekday'] = df['album_release_date'].dt.weekday
#move columns to be next to the original album release date one
month_col = df.pop('release_month')
df.insert(df.columns.get_loc('album_release_date') + 1, 'release_month', month_col)
weekday_col = df.pop('release_weekday')
df.insert(df.columns.get_loc('album_release_date') + 2, 'release_weekday', weekday_col)
df.iloc[:, 10:16].sample(5, random_state=30) #look at the added column to check (before "," = all rows, show all the columns)
| track_number | album_name | album_release_date | release_month | release_weekday | album_type | |
|---|---|---|---|---|---|---|
| 9624 | 1 | Finesse (Remix) [feat. Cardi B] | 2017-12-20 | 12 | 2 | single |
| 2879 | 4 | Labcabincalifornia (Deluxe Edition) | 1995-01-01 | 1 | 6 | album |
| 398 | 13 | The Whispers: Greatest Hits | 1987-01-01 | 1 | 3 | compilation |
| 11374 | 36 | One Thing At A Time | 2023-03-03 | 3 | 4 | album |
| 8315 | 8 | PARTYNEXTDOOR | 2013-07-01 | 7 | 0 | album |
Okay cool. Lastly, release year is at the end of the data set right now but it kinda goes with these things so I'm gonna move it.¶
release_year = df.pop('year')
df.insert(df.columns.get_loc('album_release_date') + 3, 'year', release_year)
df.iloc[:, 10:16].sample(5, random_state=30) #check again
| track_number | album_name | album_release_date | release_month | release_weekday | year | |
|---|---|---|---|---|---|---|
| 9624 | 1 | Finesse (Remix) [feat. Cardi B] | 2017-12-20 | 12 | 2 | 2017 |
| 2879 | 4 | Labcabincalifornia (Deluxe Edition) | 1995-01-01 | 1 | 6 | 1995 |
| 398 | 13 | The Whispers: Greatest Hits | 1987-01-01 | 1 | 3 | 1987 |
| 11374 | 36 | One Thing At A Time | 2023-03-03 | 3 | 4 | 2023 |
| 8315 | 8 | PARTYNEXTDOOR | 2013-07-01 | 7 | 0 | 2013 |
Okay, now I feel like our data set has some good information! Let's just check for N/As to clean it up completely.¶
df.isna().sum()
track_id 0 track_name 0 clean_track_name 0 letters_track_name 0 popularity 0 available_markets 1 market_count 1 disc_number 0 duration_ms 0 explicit 0 track_number 0 album_name 0 album_release_date 0 release_month 0 release_weekday 0 year 0 album_type 0 album_total_tracks 0 artists_names 0 artist_numb 0 principal_artist_name 0 artist_genres 105 genres_numb 105 principal_artist_followers 0 acousticness 5 danceability 5 energy 5 instrumentalness 5 key 5 liveness 5 loudness 5 mode 5 speechiness 5 tempo 5 time_signature 5 valence 5 duration_min 0 dtype: int64
For genres I think I will just put the median because I don't want to drop all those rows but for the others I think I will drop them for ease¶
print(df['genres_numb'].median())
df['genres_numb'] = df['genres_numb'].fillna(df['genres_numb'].median())
3.0
I am just gonna drop the rest now¶
df = df.dropna()
Let's check¶
df.isna().sum()
track_id 0 track_name 0 clean_track_name 0 letters_track_name 0 popularity 0 available_markets 0 market_count 0 disc_number 0 duration_ms 0 explicit 0 track_number 0 album_name 0 album_release_date 0 release_month 0 release_weekday 0 year 0 album_type 0 album_total_tracks 0 artists_names 0 artist_numb 0 principal_artist_name 0 artist_genres 0 genres_numb 0 principal_artist_followers 0 acousticness 0 danceability 0 energy 0 instrumentalness 0 key 0 liveness 0 loudness 0 mode 0 speechiness 0 tempo 0 time_signature 0 valence 0 duration_min 0 dtype: int64
BEAUTIFUL!!!!¶
Run a Multiple Linear Regression to see which variables might affect song popularity!¶
print(df.columns)
Index(['track_id', 'track_name', 'clean_track_name', 'letters_track_name',
'popularity', 'available_markets', 'market_count', 'disc_number',
'duration_ms', 'explicit', 'track_number', 'album_name',
'album_release_date', 'release_month', 'release_weekday', 'year',
'album_type', 'album_total_tracks', 'artists_names', 'artist_numb',
'principal_artist_name', 'artist_genres', 'genres_numb',
'principal_artist_followers', 'acousticness', 'danceability', 'energy',
'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
'speechiness', 'tempo', 'time_signature', 'valence', 'duration_min'],
dtype='object')
All of my columns need to be numeric variables for this to work¶
df.dtypes
track_id object track_name object clean_track_name object letters_track_name int64 popularity int64 available_markets object market_count float64 disc_number int64 duration_ms int64 explicit bool track_number int64 album_name object album_release_date datetime64[ns] release_month int64 release_weekday int64 year int64 album_type object album_total_tracks int64 artists_names object artist_numb int64 principal_artist_name object artist_genres object genres_numb float64 principal_artist_followers float64 acousticness float64 danceability float64 energy float64 instrumentalness float64 key float64 liveness float64 loudness float64 mode float64 speechiness float64 tempo float64 time_signature float64 valence float64 duration_min float64 dtype: object
irrelevant = ['track_id', 'track_name', 'clean_track_name', 'available_markets', 'album_name', 'album_release_date', 'artists_names', 'principal_artist_name', 'artist_genres'] #remove variables that won't be helpful anymore based on the calculations we made
df_regression = df.drop(columns=irrelevant) #drop them
#convert my categorical columns to numeric
df_regression['explicit'] = df_regression['explicit'].astype(int)
df_regression['album_type'] = df_regression['album_type'].astype('category').cat.codes
pred = df_regression.drop(columns=['popularity']) #drop target and keep all other columns to be predictors
y = df_regression['popularity'] #target
#check that my "numeric" columns actually are
pred.dtypes.head(50)
letters_track_name int64 market_count float64 disc_number int64 duration_ms int64 explicit int64 track_number int64 release_month int64 release_weekday int64 year int64 album_type int8 album_total_tracks int64 artist_numb int64 genres_numb float64 principal_artist_followers float64 acousticness float64 danceability float64 energy float64 instrumentalness float64 key float64 liveness float64 loudness float64 mode float64 speechiness float64 tempo float64 time_signature float64 valence float64 duration_min float64 dtype: object
BEAUTIFUL!!!!¶
Now I get to run my model FINALLY¶
#one package I could use
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(pred, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
import statsmodels.api as sm
# Add a constant column for the intercept
pred_sm = sm.add_constant(pred)
# Fit the OLS regression
model = sm.OLS(y, pred_sm).fit()
# Print the full summary
print(model.summary2())
Results: Ordinary least squares
=================================================================================
Model: OLS Adj. R-squared: 0.506
Dependent Variable: popularity AIC: 66659.1416
Date: 2025-12-10 13:51 BIC: 66854.1402
No. Observations: 10119 Log-Likelihood: -33303.
Df Model: 26 F-statistic: 398.9
Df Residuals: 10092 Prob (F-statistic): 0.00
R-squared: 0.507 Scale: 42.393
---------------------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const -875.2074 16.7820 -52.1515 0.0000 -908.1035 -842.3113
letters_track_name -0.0404 0.0104 -3.8753 0.0001 -0.0609 -0.0200
market_count 0.0305 0.0011 26.5829 0.0000 0.0282 0.0327
disc_number -0.1517 0.4035 -0.3760 0.7069 -0.9427 0.6392
duration_ms 0.0000 0.0000 0.9626 0.3358 -0.0000 0.0000
explicit -0.1675 0.1895 -0.8838 0.3768 -0.5390 0.2040
track_number -0.0786 0.0158 -4.9835 0.0000 -0.1095 -0.0477
release_month 0.1007 0.0183 5.5097 0.0000 0.0649 0.1366
release_weekday 0.2183 0.0379 5.7628 0.0000 0.1441 0.2926
year 0.4687 0.0083 56.7773 0.0000 0.4525 0.4849
album_type 0.4337 0.1281 3.3868 0.0007 0.1827 0.6847
album_total_tracks -0.0015 0.0084 -0.1766 0.8598 -0.0179 0.0149
artist_numb 0.1398 0.0876 1.5962 0.1105 -0.0319 0.3115
genres_numb -0.0818 0.0355 -2.3056 0.0212 -0.1514 -0.0123
principal_artist_followers 0.0000 0.0000 20.0420 0.0000 0.0000 0.0000
acousticness 0.7609 0.3353 2.2691 0.0233 0.1036 1.4183
danceability 2.1622 0.5200 4.1581 0.0000 1.1429 3.1814
energy 0.8040 0.5807 1.3845 0.1662 -0.3343 1.9423
instrumentalness 0.1241 0.4307 0.2881 0.7733 -0.7202 0.9684
key -0.0245 0.0184 -1.3296 0.1837 -0.0605 0.0116
liveness -1.3275 0.4552 -2.9165 0.0035 -2.2197 -0.4353
loudness 0.1274 0.0289 4.4123 0.0000 0.0708 0.1840
mode -0.7832 0.1426 -5.4939 0.0000 -1.0626 -0.5037
speechiness -6.6990 0.8265 -8.1051 0.0000 -8.3192 -5.0789
tempo -0.0040 0.0022 -1.7707 0.0766 -0.0083 0.0004
time_signature -0.1837 0.1863 -0.9863 0.3240 -0.5488 0.1814
valence -0.3764 0.3482 -1.0810 0.2797 -1.0590 0.3062
duration_min 0.0000 0.0000 1.2872 0.1981 -0.0000 0.0000
---------------------------------------------------------------------------------
Omnibus: 391.996 Durbin-Watson: 0.829
Prob(Omnibus): 0.000 Jarque-Bera (JB): 438.736
Skew: 0.510 Prob(JB): 0.000
Kurtosis: 3.032 Condition No.: 11332367532800962
=================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The smallest eigenvalue is 3.24e-14. This might indicate that
there are strong multicollinearity problems or that the design
matrix is singular.
#look at it all without it truncating
summary_df = pd.DataFrame({
'coef': model.params,
'std_err': model.bse,
't': model.tvalues,
'p_value': model.pvalues
})
pd.set_option('display.float_format', '{:.6f}'.format) #sets it to not go into scientfiic notation
print(summary_df.to_string())
coef std_err t p_value const -875.207385 16.782011 -52.151521 0.000000 letters_track_name -0.040449 0.010438 -3.875262 0.000107 market_count 0.030465 0.001146 26.582909 0.000000 disc_number -0.151736 0.403504 -0.376047 0.706890 duration_ms 0.000001 0.000001 0.962634 0.335754 explicit -0.167513 0.189534 -0.883818 0.376816 track_number -0.078553 0.015763 -4.983506 0.000001 release_month 0.100728 0.018282 5.509739 0.000000 release_weekday 0.218341 0.037888 5.762796 0.000000 year 0.468707 0.008255 56.777259 0.000000 album_type 0.433719 0.128062 3.386797 0.000710 album_total_tracks -0.001480 0.008378 -0.176633 0.859800 artist_numb 0.139821 0.087593 1.596248 0.110465 genres_numb -0.081814 0.035486 -2.305554 0.021156 principal_artist_followers 0.000000 0.000000 20.041984 0.000000 acousticness 0.760938 0.335347 2.269103 0.023283 danceability 2.162157 0.519986 4.158102 0.000032 energy 0.803984 0.580719 1.384463 0.166247 instrumentalness 0.124082 0.430733 0.288072 0.773297 key -0.024465 0.018401 -1.329601 0.183680 liveness -1.327505 0.455177 -2.916458 0.003548 loudness 0.127407 0.028875 4.412330 0.000010 mode -0.783161 0.142552 -5.493875 0.000000 speechiness -6.699014 0.826517 -8.105109 0.000000 tempo -0.003962 0.002237 -1.770659 0.076648 time_signature -0.183711 0.186267 -0.986281 0.324019 valence -0.376430 0.348239 -1.080954 0.279743 duration_min 0.000000 0.000000 1.287185 0.198059
Oh wow so quite a few of them are significant predictors and my R squared is decently high. Seems promising to me!¶
List of variables with a p value less than 0.05 -- statistically significant contributors!!!!
letters_track_name
market_count
track_number
release_month
release_weekday
year
album_type
genres_numb
principal_artist_followers
acousticness
danceability
liveness
loudness
mode
speechiness
Let's check for multicollinearity just to make sure¶
from statsmodels.stats.outliers_influence import variance_inflation_factor
# add constant/beta 0/intercept added
pred_vif = sm.add_constant(pred)
# Compute VIF for each column
vif_df = pd.DataFrame({
"feature": pred_vif.columns,
"VIF": [variance_inflation_factor(pred_vif.values, i) #values convert to numpy array
for i in range(pred_vif.shape[1])] #shape takes all the columns and subtracts 1 so it counts the indexes accurately
}) #so basically go through all calumns and calculate the variance
print(vif_df)
feature VIF 0 const 67225.254413 1 letters_track_name 1.034509 2 market_count 1.050163 3 disc_number 1.256947 4 duration_ms inf 5 explicit 1.609970 6 track_number 1.166390 7 release_month 1.138891 8 release_weekday 1.142766 9 year 1.953558 10 album_type 1.242710 11 album_total_tracks 1.472709 12 artist_numb 1.088994 13 genres_numb 1.201645 14 principal_artist_followers 1.208237 15 acousticness 1.837259 16 danceability 1.703249 17 energy 3.412144 18 instrumentalness 1.211604 19 key 1.020806 20 liveness 1.055523 21 loudness 2.832940 22 mode 1.057183 23 speechiness 1.448800 24 tempo 1.089438 25 time_signature 1.066449 26 valence 1.746087 27 duration_min inf
Looks pretty good to me!¶
Some explanations about the VIF stuff¶
- What is VIF?
- Variance Inflation Factor (VIF) is a statistical measure used in regression analysis to identify multicollinearity
- A VIF of 1 means no correlation, while a VIF exceeding 10 can signal serious multicollinearity
- A model with a low VIF is more stable because its predictors are less influenced by each other
- Infinite VIF
- duration_ms: VIF = infinity !!!
- duration_ms is perfectly collinear with one or more other predictors or it has near-zero variance
- this make sense because duration_ms and duration_min are the same numbers but converted unit-wise
- Most other predictors with VIF < 5 are Totally acceptable to use in our simulation! - - We know they are not colinear
- liveness
- mode
- speechiness
- tempo
- time_signature
Next Steps¶
Possible Ideas (won't do all of them)¶
- for each predictor, find the range of values so we know what boundaries to simulate within
- simulate a whole bunch of songs and see trends, what makes something more popular
- take like top 10% or something and see what they have in common
- graphs of course
- maybe map two features against eachother
- maybe we could add a random luck factor for going viral
- likelihoood it goes viral on tiktok or it gets put in a show/movie or some things like that
- maybe have another simulation where you input your own song and it generates a popularity score
- then you can change one variable and see how it affects the score
- could make a distribution for teh same song but from 1986 to 2023 to see the change over time
- could do an agent where there are certain probabilities that someone shares a song
Start OOP¶
Goal: start by simulating the features of one song, then simulate the popularity of that one song, then we can run that simulation a whole bunch of times and start making some visulaizations¶
Maybe let's start by finding the ranges/probabilities of the data we already have so that we can simulate new songs that are realistic¶
print(pred.columns)
Index(['letters_track_name', 'market_count', 'disc_number', 'duration_ms',
'explicit', 'track_number', 'release_month', 'release_weekday', 'year',
'album_type', 'album_total_tracks', 'artist_numb', 'genres_numb',
'principal_artist_followers', 'acousticness', 'danceability', 'energy',
'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
'speechiness', 'tempo', 'time_signature', 'valence', 'duration_min'],
dtype='object')
continuous_cols = ['letters_track_name', 'market_count', 'disc_number', 'duration_ms',
'explicit', 'track_number', 'release_month', 'release_weekday', 'year',
'album_type', 'album_total_tracks', 'artist_numb', 'genres_numb',
'principal_artist_followers', 'acousticness', 'danceability', 'energy',
'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
'speechiness', 'tempo', 'time_signature', 'valence', 'duration_min']
cont_min = pred[continuous_cols].min()
cont_max = pred[continuous_cols].max()
print(cont_max)
print(cont_min)
letters_track_name 76.000000 market_count 184.000000 disc_number 10.000000 duration_ms 2238733.000000 explicit 1.000000 track_number 48.000000 release_month 12.000000 release_weekday 6.000000 year 2023.000000 album_type 2.000000 album_total_tracks 176.000000 artist_numb 40.000000 genres_numb 15.000000 principal_artist_followers 114675033.000000 acousticness 0.996000 danceability 0.988000 energy 1.000000 instrumentalness 1.000000 key 11.000000 liveness 0.982000 loudness 0.522000 mode 1.000000 speechiness 0.944000 tempo 220.099000 time_signature 5.000000 valence 0.994000 duration_min 37.312217 dtype: float64 letters_track_name 0.000000 market_count 1.000000 disc_number 1.000000 duration_ms 33493.000000 explicit 0.000000 track_number 1.000000 release_month 1.000000 release_weekday 0.000000 year 1986.000000 album_type 0.000000 album_total_tracks 1.000000 artist_numb 1.000000 genres_numb 1.000000 principal_artist_followers 100.000000 acousticness 0.000000 danceability 0.000000 energy 0.000020 instrumentalness 0.000000 key 0.000000 liveness 0.000000 loudness -47.070000 mode 0.000000 speechiness 0.000000 tempo 0.000000 time_signature 0.000000 valence 0.000000 duration_min 0.558217 dtype: float64
## prediction function
def Predict_Popularity(predictor_list, model):
"""
Predicts track popularity using a list of predictor values
corresponding to the significant predictors in the reduced model.
Parameters
----------
predictor_list : list
A list of values in the correct order of the significant predictors.
model : fitted statsmodels OLS model
Returns
-------
float : predicted popularity score
"""
# Significant predictors in correct order
predictor_names = [
"letters_track_name",
"market_count",
"track_number",
"release_month",
"release_weekday",
"year",
"album_type",
"genres_numb",
"principal_artist_followers",
"acousticness",
"danceability",
"liveness",
"loudness",
"mode",
"speechiness"
]
# Build input dataframe
input_data = pd.DataFrame([dict(zip(predictor_names, predictor_list))])
# Predict
prediction = model.predict(input_data)
return float(prediction.iloc[0])
significant_predictors = [
"letters_track_name",
"market_count",
"track_number",
"release_month",
"release_weekday",
"year",
"genres_numb",
"principal_artist_followers",
"acousticness",
"danceability",
"liveness",
"loudness",
"mode",
"speechiness"
]
I decided to redo the cleaning / OLS model because I wanted to reduce the amount of explanatory variables we were using
I only used the variables that were deemed significant from the earlier OLS model!
df_clean = df[significant_predictors].copy()
# Convert numeric columns cleanly
df_clean = df_clean.apply(pd.to_numeric, errors="coerce")
# Remove any NaN rows
df_clean = df_clean.dropna()
# Align y
y_clean = df.loc[df_clean.index, "popularity"]
# Fit reduced model
X_reduced = sm.add_constant(df_clean)
model_reduced = sm.OLS(y_clean, X_reduced).fit()
print(model_reduced.summary())
OLS Regression Results
==============================================================================
Dep. Variable: popularity R-squared: 0.506
Model: OLS Adj. R-squared: 0.505
Method: Least Squares F-statistic: 737.8
Date: Wed, 10 Dec 2025 Prob (F-statistic): 0.00
Time: 13:52:01 Log-Likelihood: -33316.
No. Observations: 10119 AIC: 6.666e+04
Df Residuals: 10104 BIC: 6.677e+04
Df Model: 14
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -885.2613 14.748 -60.027 0.000 -914.170 -856.353
letters_track_name -0.0381 0.010 -3.669 0.000 -0.058 -0.018
market_count 0.0305 0.001 26.700 0.000 0.028 0.033
track_number -0.0908 0.015 -6.102 0.000 -0.120 -0.062
release_month 0.1017 0.018 5.569 0.000 0.066 0.137
release_weekday 0.2143 0.038 5.662 0.000 0.140 0.289
year 0.4735 0.007 64.512 0.000 0.459 0.488
genres_numb -0.0851 0.035 -2.422 0.015 -0.154 -0.016
principal_artist_followers 7.894e-08 3.9e-09 20.222 0.000 7.13e-08 8.66e-08
acousticness 0.6054 0.293 2.064 0.039 0.030 1.180
danceability 1.8533 0.418 4.429 0.000 1.033 2.674
liveness -1.2798 0.452 -2.832 0.005 -2.166 -0.394
loudness 0.1337 0.021 6.330 0.000 0.092 0.175
mode -0.7934 0.141 -5.628 0.000 -1.070 -0.517
speechiness -6.8152 0.734 -9.287 0.000 -8.254 -5.377
==============================================================================
Omnibus: 394.854 Durbin-Watson: 0.834
Prob(Omnibus): 0.000 Jarque-Bera (JB): 442.314
Skew: 0.512 Prob(JB): 8.97e-97
Kurtosis: 3.030 Cond. No. 4.62e+09
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.62e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
def simulate_song_features(df, predictors):
"""
Randomly generate one simulated song by sampling from the empirical
distributions of each predictor.
"""
sim = {}
for col in predictors:
sim[col] = np.random.choice(df[col].values)
return sim
def predict_popularity(sim_features, model):
"""
Takes a simulated dict of features and returns predicted popularity.
"""
df_input = pd.DataFrame([sim_features])
df_input = sm.add_constant(df_input, has_constant="add")
return model.predict(df_input)[0]
def simulate_one_song(df, predictors, model):
song = simulate_song_features(df, predictors)
popularity = predict_popularity(song, model)
return song, popularity
sim_results = []
N = 5000 # number of simulated songs
for i in range(N):
song, popularity = simulate_one_song(df_clean, significant_predictors, model_reduced)
song["predicted_popularity"] = popularity
sim_results.append(song)
sim_df = pd.DataFrame(sim_results)
plt.figure(figsize=(10,5))
sns.histplot(sim_df["predicted_popularity"], bins=40, kde=True)
plt.title("Distribution of Simulated Song Popularity")
plt.xlabel("Predicted Popularity Score")
plt.ylabel("Count")
plt.show()
I took a data visualizations course and wanted to see an interactive graph of the variabes in comparison!¶
First let's see how danceability affects popularity
import altair as alt
sim_sample = sim_df.sample(400)
sns.pairplot(sim_sample, diag_kind="kde", corner=True)
plt.suptitle("Pairplot of Simulated Song Features", y=1.02, fontsize=14)
plt.show()
Which variables might be correlated?¶
Using a pairplot we can easily see what variables have a linear relationship with the predicted popularity.
The clearest linear plot I see is with the year -- I created a zoomed in scatter plot of popularity vs year below using altair.
df_sample = df.sample(400)
alt.data_transformers.disable_max_rows()
# Brush for zoom + pan
brush = alt.selection_interval(bind="scales")
# Create interactive scatterplot
chart = (
alt.Chart(df_sample)
.mark_circle(size=80)
.encode(
x=alt.X("year:Q", bin=alt.Bin(step=2), title="Release Year"),
y=alt.Y("popularity:Q", title="Predicted Popularity"),
tooltip=[
"track_name:N",
"year:Q",
"popularity:Q",
"danceability:Q",
"acousticness:Q"
]
)
.add_params( brush)
.properties(
width=800,
height=700,
title="Predicted Popularity vs. Release Year (Interactive)"
)
)
chart
popularity vs followers¶
above you can see I wanted to see how followers affect popularity, there is not much correlation
def interactive_scatter(df, x_var, y_var, hover_var="track_name"):
"""
Creates an interactive Altair scatterplot comparing two variables,
showing song name (or any hover variable) on hover.
Parameters
----------
df : DataFrame
Your cleaned dataframe.
x_var : str
Column name for x-axis.
y_var : str
Column name for y-axis.
hover_var : str
Variable to show when hovering (default = track_name).
"""
chart = (
alt.Chart(df)
.mark_circle(size=90)
.encode(
x=alt.X(x_var, title=x_var),
y=alt.Y(y_var, title=y_var),
tooltip=[hover_var, x_var, y_var]
)
.interactive()
.properties(
width=600,
height=400,
title=f"{x_var} vs {y_var} (Hover to see {hover_var})"
)
)
return chart
interactive_scatter(df_sample, "danceability", "popularity", hover_var="track_name")
interactive_scatter(df_sample, "tempo", "popularity", hover_var="track_name")
interactive_scatter(df_sample, "time_signature", "popularity", hover_var="track_name")
Some key outliers are "brown noise" and "Box fan sound" which do not have any time signature or tempo, and therefore have no danceability. I also would argue that some songs such as "Always forever" by Cults are not ranked the best in terms of danceability, I think I could dance to that song!
Another Simulations with Added Noise and Virality Variable + New Way of Analyzing Current Variables¶
df['album_type'] = df['album_type'].astype('category').cat.codes.astype('int8')
predictor_cols = [
'letters_track_name', 'market_count', 'disc_number', 'duration_ms',
'explicit', 'track_number', 'release_month', 'release_weekday', 'year',
'album_type', 'album_total_tracks', 'artist_numb', 'genres_numb',
'principal_artist_followers', 'acousticness', 'danceability', 'energy',
'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
'speechiness', 'tempo', 'time_signature', 'valence', 'duration_min'
]
DEFAULT_N_SIMS = 10000
TOP_PCT = 0.10 # top 10%
RANDOM_SEED = 42
# Viral parameters
VIRAL_ON = True
VIRAL_PROB = 0.01 # 1% chance
VIRAL_BOOST_MEAN = 20.0
VIRAL_BOOST_STD = 5.0
# Noise settings (per-column multiplier of observed std; can be scalar or dict)
NOISE_SCALE = 1.0 # will multiply each column's std by this for Gaussian noise
# Helper: compute ranges & stats
#Returns a DataFrame with min, max, mean, std, and dtype for each predictor column.
def get_predictor_stats(df, cols):
stats = []
for c in cols:
s = df[c]
stats.append({
'col': c,
'min': s.min(),
'max': s.max(),
'mean': s.mean(),
'std': s.std(ddof=0), # population std to match simulation scale
'dtype': s.dtype
})
return pd.DataFrame(stats).set_index('col')
#Sample continuous variable using normal around mean, switch to uniform if std is 0.
def sample_continuous(col_stats, n):
mean, std, mn, mx = col_stats['mean'], col_stats['std'], col_stats['min'], col_stats['max']
if pd.isna(std) or std == 0:
return np.random.uniform(mn, mx, size=n)
else:
samp = np.random.normal(loc=mean, scale=std * NOISE_SCALE, size=n)
# clip to observed min/max
return np.clip(samp, mn, mx)
#Sample integer-like columns by sampling uniform in [min, max] and rounding.
def sample_integer(col_stats, n):
mn, mx = int(col_stats['min']), int(col_stats['max'])
if mn == mx:
return np.full(n, mn, dtype=int)
# sample integers uniformly
return np.random.randint(mn, mx + 1, size=n)
#for things like acousticness/danceability in [0,1] - sample Beta-like behavior using clipped normal
def sample_bounded_ratio(col_stats, n):
mn, mx, mean, std = col_stats['min'], col_stats['max'], col_stats['mean'], col_stats['std']
if std == 0 or pd.isna(std):
return np.clip(np.full(n, mean), mn, mx)
s = np.random.normal(loc=mean, scale=std * NOISE_SCALE, size=n)
return np.clip(s, mn, mx)
Similar Simulation¶
def simulate_songs(df, model, predictor_cols, n_sims=DEFAULT_N_SIMS, seed=RANDOM_SEED,
viral_on=VIRAL_ON, viral_prob=VIRAL_PROB,
viral_boost_mean=VIRAL_BOOST_MEAN, viral_boost_std=VIRAL_BOOST_STD):
"""
Simulate synthetic songs and predict popularity using statsmodels `model`.
Returns the simulated DataFrame with `predicted_popularity` and `predicted_popularity_with_viral`.
"""
np.random.seed(seed)
stats = get_predictor_stats(df, predictor_cols)
sim = pd.DataFrame(index=range(n_sims))
for col in predictor_cols:
cs = stats.loc[col]
dtype = cs['dtype']
# Identify how to sample:
# - many features are bounded 0..1 (acousticness, danceability, etc.)
# - some are integer flags/counts (explicit, disc_number, track_number, album_type)
# - duration_ms, tempo, loudness continuous
if dtype in (np.int8, np.int16, np.int32, np.int64) or col in [
'disc_number', 'explicit', 'track_number', 'release_month', 'release_weekday', 'year',
'album_type', 'album_total_tracks', 'artist_numb', 'key', 'time_signature', 'duration_min'
]:
# integer-like sampling
sim[col] = sample_integer(cs, n_sims)
else:
# float columns - decide if bounded ratio or general continuous
if col in ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'valence', 'mode']:
sim[col] = sample_bounded_ratio(cs, n_sims)
elif col in ['key', 'time_signature']:
sim[col] = sample_integer(cs, n_sims)
else:
sim[col] = sample_continuous(cs, n_sims)
# Ensure types mirror original where appropriate
for col in predictor_cols:
if df[col].dtype in (np.int8, np.int16, np.int32, np.int64):
sim[col] = sim[col].round().astype(int)
# Prepare dataframe for model prediction. statsmodels expects same columns order and a constant if used.
# If model was fit with a constant (safer to add), use sm.add_constant
sim_for_pred = sim.copy()
sim_for_pred = sm.add_constant(sim_for_pred, has_constant='add')
# Use model.predict
try:
sim['predicted_popularity'] = model.predict(sim_for_pred)
except Exception as e:
raise RuntimeError("Failed to predict with provided model. Make sure `model` is a fitted statsmodels OLS results object and predictor columns match.") from e
# Optionally add viral luck factor
sim['predicted_popularity_with_viral'] = sim['predicted_popularity'].copy()
if viral_on:
# determine viral events
viral_events = np.random.rand(n_sims) < viral_prob
viral_boosts = np.random.normal(loc=viral_boost_mean, scale=viral_boost_std, size=n_sims)
viral_boosts = np.where(viral_events, viral_boosts, 0.0)
sim['viral_boost'] = viral_boosts
sim['predicted_popularity_with_viral'] += sim['viral_boost']
else:
sim['viral_boost'] = 0.0
# Clip popularity to realistic bounds if you prefer (e.g., 0-100)
sim['predicted_popularity'] = sim['predicted_popularity'].clip(lower=0, upper=100)
sim['predicted_popularity_with_viral'] = sim['predicted_popularity_with_viral'].clip(lower=0, upper=100)
return sim, stats
# --- Analysis functions ---------------------------------------------------
def analyze_top(sim, pct=TOP_PCT, target_col='predicted_popularity_with_viral'):
"""
Return top fraction of rows and summary statistics comparing top vs rest.
"""
n_top = int(len(sim) * pct)
top = sim.nlargest(n_top, target_col)
rest = sim.drop(top.index)
summary = pd.DataFrame({
'top_mean': top[predictor_cols].mean(),
'rest_mean': rest[predictor_cols].mean(),
'top_median': top[predictor_cols].median(),
'rest_median': rest[predictor_cols].median(),
})
summary['mean_diff'] = summary['top_mean'] - summary['rest_mean']
return top, rest, summary.sort_values('mean_diff', ascending=False)
Plot Functions!¶
# --- Plotting -------------------------------------------------------------
def plot_histogram(sim, col='predicted_popularity_with_viral', bins=50, savepath=None):
plt.figure(figsize=(8,5))
plt.hist(sim[col], bins=bins)
plt.title(f'Histogram of {col}')
plt.xlabel('Predicted popularity')
plt.ylabel('Count')
if savepath:
plt.savefig(savepath, bbox_inches='tight')
plt.show()
def plot_scatter(sim, x, y, target_col='predicted_popularity_with_viral', hue_quantile=None, savepath=None):
plt.figure(figsize=(7,6))
# color by popularity if desired
sc = plt.scatter(sim[x], sim[y], c=sim[target_col], alpha=0.6)
plt.colorbar(sc, label=target_col)
plt.xlabel(x)
plt.ylabel(y)
plt.title(f'{x} vs {y} colored by {target_col}')
if savepath:
plt.savefig(savepath, bbox_inches='tight')
plt.show()
def plot_correlation_heatmap(df_sim, cols=None, savepath=None):
if cols is None:
cols = predictor_cols + ['predicted_popularity_with_viral']
corr = df_sim[cols].corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr, annot=False, cmap='vlag', center=0)
plt.title('Correlation matrix (simulated data)')
if savepath:
plt.savefig(savepath, bbox_inches='tight')
plt.show()
def plot_top_vs_rest_feature(sim, feature, top_idx=None, pct=TOP_PCT, savepath=None):
top, rest, _ = analyze_top(sim, pct=pct)
plt.figure(figsize=(8,5))
sns.kdeplot(rest[feature], label='rest', fill=True)
sns.kdeplot(top[feature], label='top', fill=True)
plt.title(f'Distribution of {feature} - top {int(pct*100)}% vs rest')
plt.legend()
if savepath:
plt.savefig(savepath, bbox_inches='tight')
plt.show()
Run the Simulation¶
if __name__ == "__main__":
# Run simulation (adjust n_sims as needed)
sim_df, stats_df = simulate_songs(df=df, model=model, predictor_cols=predictor_cols, n_sims=DEFAULT_N_SIMS)
print("Simulated shape:", sim_df.shape)
print("Predictor stats:\n", stats_df)
# Basic analysis
top, rest, summary = analyze_top(sim_df, pct=TOP_PCT)
print(f"Top {int(TOP_PCT*100)}% count:", len(top))
display_cols = ['top_mean', 'rest_mean', 'mean_diff']
print("Top vs Rest summary (top mean - rest mean):")
print(summary[display_cols].head(20))
# Plots
plot_histogram(sim_df, col='predicted_popularity_with_viral')
# Two-feature scatter examples (change to features you'd like)
plot_scatter(sim_df, 'danceability', 'energy')
plot_scatter(sim_df, 'principal_artist_followers', 'liveness')
plot_correlation_heatmap(sim_df)
# Visual comparisons for top vs rest for interesting features
for feat in ['danceability', 'energy', 'acousticness', 'genres_numb', 'principal_artist_followers', 'loudness']:
if feat in sim_df.columns:
plot_top_vs_rest_feature(sim_df, feat)
# Save simulated dataset if you want
# sim_df.to_csv('simulated_songs.csv', index=False)
Simulated shape: (10000, 30)
Predictor stats:
min max mean \
col
letters_track_name 0 76 12.510821
market_count 1.000000 184.000000 159.008301
disc_number 1 10 1.016405
duration_ms 33493 2238733 228975.991303
explicit False True 0.250519
track_number 1 48 5.309220
release_month 1 12 5.328194
release_weekday 0 6 2.364463
year 1986 2023 2004.258326
album_type 0 2 0.208914
album_total_tracks 1 176 14.248641
artist_numb 1 40 1.249926
genres_numb 1.000000 15.000000 3.776065
principal_artist_followers 100.000000 114675033.000000 9507767.987449
acousticness 0.000000 0.996000 0.225428
danceability 0.000000 0.988000 0.611739
energy 0.000020 1.000000 0.652720
instrumentalness 0.000000 1.000000 0.045690
key 0.000000 11.000000 5.265441
liveness 0.000000 0.982000 0.183409
loudness -47.070000 0.522000 -7.521953
mode 0.000000 1.000000 0.679020
speechiness 0.000000 0.944000 0.089384
tempo 0.000000 220.099000 121.007037
time_signature 0.000000 5.000000 3.929044
valence 0.000000 0.994000 0.539131
duration_min 0.558217 37.312217 3.816267
std dtype
col
letters_track_name 6.307239 int64
market_count 57.877488 float64
disc_number 0.179841 int64
duration_ms 64891.284072 int64
explicit 0.433312 bool
track_number 4.434761 int64
release_month 3.778310 int64
release_weekday 1.826225 int64
year 10.958820 int64
album_type 0.563434 int8
album_total_tracks 9.375143 int64
artist_numb 0.771115 int64
genres_numb 1.999470 float64
principal_artist_followers 17923152.799975 float64
acousticness 0.261618 float64
danceability 0.162452 float64
energy 0.205885 float64
instrumentalness 0.165405 float64
key 3.554011 float64
liveness 0.146094 float64
loudness 3.772877 float64
mode 0.466853 float64
speechiness 0.094261 float64
tempo 30.195821 float64
time_signature 0.358850 float64
valence 0.245603 float64
duration_min 1.081521 float64
Top 10% count: 1000
Top vs Rest summary (top mean - rest mean):
top_mean rest_mean mean_diff
principal_artist_followers 18167389.212643 12326918.396237 5840470.816406
duration_ms 1304507.636000 1132531.212778 171976.423222
market_count 161.025743 144.572784 16.452959
year 2017.930000 2003.088000 14.842000
artist_numb 26.686000 19.736000 6.950000
album_total_tracks 90.062000 88.821556 1.240444
release_month 6.948000 6.421222 0.526778
loudness -7.083061 -7.601161 0.518101
release_weekday 3.375000 2.971778 0.403222
duration_min 18.661000 18.464667 0.196333
album_type 1.095000 0.971889 0.123111
danceability 0.628153 0.606131 0.022022
acousticness 0.267073 0.249835 0.017238
energy 0.661337 0.647949 0.013388
instrumentalness 0.095568 0.091874 0.003695
explicit 0.496000 0.492667 0.003333
valence 0.532927 0.539260 -0.006332
liveness 0.183195 0.191905 -0.008710
speechiness 0.084891 0.100936 -0.016045
mode 0.598108 0.633682 -0.035575
Very cool!¶
Key Conclusions¶
Overall we found that lots of people tend to listen to music in time_signature 3-4 and some of the best predictors for popularity are artist followers, and the year released. our simulations found that other variables such as time_signature and danceability can also play a role, yet the effect is less due to smaller values of time signature, and the fact that our data was too broad.
Future simulation projects should focus on more specific genres or audiences that we can target, which could provide more meaningful results instead of being so broad like our project.
We hypothesize that popularity is also affected by other variables such as advertisements, and managing teams, which funnel a lot of money in to create a persona for the artist creating the song.
More work can be done to run simulations on how we choose the music we listen to.